Our era of data - larger than ever and complex like chaos - requires several skills from statisticians and other data scientists. In this course I am about to learn some new skills to be able to work with data
https://github.com/lottaimmeli/IODS-project
This week I am doing regression analysis and model validation. But first I had to wrangle with the original data. That was pretty difficult for me, but it was good practice and I just need to practice a lot more
# After I had been wrangling with the data and managed to write it out in a folder. I could use my own wrangled data for the further analysis...
students2014 <- read.csv(file="~/Documents/GitHub/IODS-project/Data/learning2014.csv", header = TRUE, sep=",")
#Looking the data structure
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
head(students2014)
## gender Age Attitude Points deep stra surf
## 1 F 53 37 25 3.583333 3.375 2.583333
## 2 M 55 31 12 2.916667 2.750 3.166667
## 3 F 49 25 24 3.500000 3.625 2.250000
## 4 M 53 35 10 3.500000 3.125 2.250000
## 5 M 49 37 22 3.666667 3.625 2.833333
## 6 F 38 38 21 4.750000 3.625 2.416667
This is a data of 166 observations (students). Of them we have 7 variables: the information about their gender, age, global attitude toward statistics (Attitude), exam points (Points) and their points related to different aspects of learning (Deep, strategic and surface learning).
pairs(students2014[-1], col=students2014$gender)
summary(students2014)
## gender Age Attitude Points deep
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00 Min. :1.583
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## stra surf
## Min. :1.250 Min. :1.583
## 1st Qu.:2.625 1st Qu.:2.417
## Median :3.188 Median :2.833
## Mean :3.121 Mean :2.787
## 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :5.000 Max. :4.333
I think this method is not very good. It??s very difficult to interpret it like this. I need a better graph..
#use the packages GGally and ggplot2 and get some help with the graphical overview
library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower=list(combo =wrap("facethist", bins=20)))
p
This is better view of the data. All the variables (expect for the age, where skewness >0) are pretty nicely normally distributed. This also tells me the correlation between the different variables. There doesn??t seem to be very strong correlation between the different variables since the correlation coefficients are between -0.3 - 0.4
#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.
model <- lm(Points ~ gender + Attitude + stra, data=students2014)
summary(model)
##
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97982 2.40303 3.737 0.000258 ***
## genderM -0.22362 0.92477 -0.242 0.809231
## Attitude 0.35100 0.05956 5.893 2.13e-08 ***
## stra 0.89107 0.54409 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
I chose to study the association of exam points (target value) with gender, attitude and strategic learning (explanatory variables). Here I can see that the Attitude is the only one of these three explanatory values to be statistifically siginificant (p-value is < 0.05.). Also the t value 5.893 tells about the significance
Next I make a regression model with only the one significant explanatory variable “Attitude”.
model2 <- lm(Points ~ Attitude, data=students2014)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Here I get the answer that the “Attitude” estimate is 0.35 and the p-value stays low (below 0.05 meaning it is significant). In other words this means that when Attitude increases by 1 unit the exam points increases by 0.35.
Next I need to explain and interpret the multiple R squared of the model. R-squared is a statistical measure of how close the data are to the fitted regression line. Meaning how well the model fits my data.
The definition of R-squared (selitysaste) is the percentage of the response variable variation that is explained by a linear model.
R-squared = Explained variation / Total variation R-squared is always between 0 and 100%:
Here in the summary I can see the Multiple R-squared is 0.1906 -> 19% so this means that my model would explain only one fifth of the exam points around their mean.
“In general, the higher the R-squared, the better the model fits my data. R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.”
“R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!”
There are several assumptions of linear regression model. With the help of the following plots I can analyze the residuals of my model and see how well my linear regression model is working here or is there some serious problems with it.
plot(model2, which= c(2,1,5))
This plot shows that the errors/residuals have constant variance. I can find a a equally spread residuals around a horizontal line without distinct patterns. This is a good indication!
With the Q-Q plot I can explore that the residuals are normally distributed. Here I can see that point are very close to the line expect of upper and lower tails where there is some deviation. However I think this still is reasonable and I could interpret that the errors are normally distributed.
This plot helps me to clear if I have outliers in my data that are influencial in my linear regression model. In my analysis I have no such cases that would be influencial in my model. All my cases are inside the Cook??s distance lines (I can not even see them)
This week I??m learning about logistic regression.
Before this analysis part I have been doing some data wrangling and prepared the data so that I am able to analyse it.
#opening the data from the file and checking how it looks..
alc <- read.csv("~lottaimmeli/Documents/GitHub/IODS-project/Data/alc.csv", header = TRUE, sep = ",")
This data is about student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. There were originally two datasets provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
In my data set I have combnined the information of these two datasets. I have observations of those 382 students who were present in both of the datasets Mathematics and Portuguese. In the original dataset there were 33 variables. However I have calculated some new variables ‘alc_use’ which corresponds to the average weekday and weekend alcohol consumption anb ‘high_use’ if the ‘alc_use’ is over 2.
You can find more information about the original data set and the variables here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
dim(alc)
## [1] 382 37
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [36] "probability" "prediction"
summary(alc)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use probability prediction
## Mode :logical Min. :0.2883 Mode :logical
## FALSE:268 1st Qu.:0.3597 FALSE:272
## TRUE :114 Median :0.4170 TRUE :110
## Mean :0.4555
## 3rd Qu.:0.5197
## Max. :0.9398
The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables (failure, absences, famrel and higher) and state the hypotheses.
The first variable “failure” tells about the number of past class failures. I think there might be a correlation between higher alcohol consumption and more class failures.
“Absences” corresponds to the number of school absences. I am instrested to find out if those who drink more have more school absences.
“Famrel” corresponds to the quality of family relationships. Maybe those who have poor family relationships consume more alcohol?
“Higher” clarifies if these students want to have higher education (yes or no). Maybe those who want to educate them selves in the future will study more and drink less..
#Numerically exploring the relationship between high_use and failure. table and geom_count plot
mytable <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)
addmargins(mytable)
## number_of_past_class_failures
## high_use 0 1 2 3 Sum
## FALSE 244 12 10 2 268
## TRUE 90 12 9 3 114
## Sum 334 24 19 5 382
library(ggplot2)
failure1 <- ggplot(alc, aes(x=high_use, y = failures))
failure1 + geom_count()
In this table and plot I can see that the ones who have low use of alcohol might proportionally have less class failures compared to the students with high alcohol. As seen here the ones who consume less alcohol more often have no past class failures.
#making a boxplot
g <-ggplot(alc, aes(x=high_use, y = absences))
g2 <- g + geom_boxplot() + ggtitle("Absences versus high alcohol use")
g2
It looks like there might be a correlation between high use of alcohol and more school absences.
#making a boxplot
g_fam <-ggplot(alc, aes(x=high_use, y = famrel,))
g_fam2 <- g_fam + geom_boxplot()
g_fam2
It seems that those who have low use of alcohol think that their family relationships are better (higher points). But which one is the cause and which one the effect. Are the students having bad relationships at home and they start to use more alcohol. Or is everything going well at home and from the point they start to use more alcohol the relationships at home gets more problematic.
table_plans <- table(high_use= alc$high_use, wants_high_education = alc$higher)
table_plans
## wants_high_education
## high_use no yes
## FALSE 9 259
## TRUE 9 105
round(prop.table(table_plans) * 100, 1)
## wants_high_education
## high_use no yes
## FALSE 2.4 67.8
## TRUE 2.4 27.5
Maybe it seems that those who consume more alcohol, do not have that often plan to get high education..
Now I want statistically explore the relationship between my four chosen variables and the binary high/low alcohol consumption variable as the target variable. Here is the summary of my fitted model.
#find the model with glm in the first model there is the intercept and in the m2 I took the intercept away
m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3593 -0.7890 -0.6902 1.1782 1.8993
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12562 0.74528 -0.169 0.866143
## failures 0.43484 0.19618 2.217 0.026653 *
## absences 0.08341 0.02272 3.671 0.000241 ***
## famrel -0.22708 0.12502 -1.816 0.069322 .
## higheryes -0.36274 0.55177 -0.657 0.510924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.43 on 377 degrees of freedom
## AIC: 446.43
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) failures absences famrel higheryes
## -0.12562333 0.43483926 0.08340567 -0.22707832 -0.36273561
m2 <- glm(high_use ~ failures + absences + famrel + higher - 1, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher -
## 1, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3593 -0.7890 -0.6902 1.1782 1.8993
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## failures 0.43484 0.19618 2.217 0.026653 *
## absences 0.08341 0.02272 3.671 0.000241 ***
## famrel -0.22708 0.12502 -1.816 0.069322 .
## higherno -0.12562 0.74528 -0.169 0.866143
## higheryes -0.48836 0.51592 -0.947 0.343857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.56 on 382 degrees of freedom
## Residual deviance: 436.43 on 377 degrees of freedom
## AIC: 446.43
##
## Number of Fisher Scoring iterations: 4
coef(m2)
## failures absences famrel higherno higheryes
## 0.43483926 0.08340567 -0.22707832 -0.12562333 -0.48835893
Logistic regression model summary interpretation: As seen above the p-value is low (<0.05) in failures, absences and famrel. The coefficient in failures and absences is positive (failures 0.45, absences 0.07), meaning that more failures and more absences predicts higher alcohol use. On the other hand familyrelationship seems to have negative effect (-0.31) on the high alcohol use. Meaning that better familyrelationship have protective effect on alcohol use.
The future education plan (plans to get a higher education) have also negative effect on the high alcohol use (-0.4), but it is not statistically significant.
Presenting and interpreting the coefficients of the model as odds ratios and provide confidence intervals for them.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.2.0 --
## <U+221A> tibble 1.3.4 <U+221A> purrr 0.2.4
## <U+221A> tidyr 0.7.2 <U+221A> dplyr 0.7.4
## <U+221A> readr 1.1.1 <U+221A> stringr 1.2.0
## <U+221A> tibble 1.3.4 <U+221A> forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
#compute odds ratios and and confidence intervals
OR <- coef(m) %>% exp
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures 1.5447147 1.0499664 2.282173
## absences 1.0869827 1.0418018 1.139025
## famrel 0.7968584 0.6230747 1.019051
## higheryes 0.6957704 0.2372414 2.121797
In this table one can see the coefficients of the model as odds ratios and their confidence intervals. The odds ratios for failures is 1.54 (CI 1.05-2.28) and for absences 1.08 (1.04-1.14) meaning that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times respectively. The better family relationship makes the odds for high alcohol consumption 0.8 times less likely.
The higher educational plan is not significant since the coefficient is 0.70 and CI is 0.2-2.1 (number 1 in included in the CI).
These findings support my earlier hypotheses.
I??m using the variables which had a statistical relationship with high/low alcohol consumption according to my model and exploring the predictive power.
# predict the probability of high use and then adding, the new variable in the alc dataset. Then using the probability to make a prediction of high_use. lookin the last 10 observations
probabilities <- predict(m, type= "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
## failures absences famrel high_use probability prediction
## 373 1 0 4 FALSE 0.2765114 FALSE
## 374 1 7 5 TRUE 0.3531843 FALSE
## 375 0 1 5 FALSE 0.1764851 FALSE
## 376 0 6 4 FALSE 0.2898242 FALSE
## 377 1 2 5 FALSE 0.2646186 FALSE
## 378 0 2 4 FALSE 0.2262058 FALSE
## 379 2 2 2 FALSE 0.5234763 TRUE
## 380 0 3 1 FALSE 0.3857482 FALSE
## 381 0 4 2 TRUE 0.3523118 FALSE
## 382 0 2 4 TRUE 0.2262058 FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.04712042 0.70157068
## TRUE 0.25392670 0.04450262 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
gpred <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
gpred + geom_point()
# the training error
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) >0.5
mean(n_wrong)
}
The proportion of incorrectly classified observations (meaning the training error) is (0.30) 30%. I think the percentage is pretty high. I can’t really think my model is very good when one third of the observations are incorrectly classified.
Then I also made a 10-fold cross-validation where I got the same number (0.30) 30% as the average number of wrong predictions.
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293
The theme for this week is clustering and classification. This is something totally new for me and let??s see what I will learn this week.
In this exercise we are examining the Boston dataset from R MASS package. This dataset is about housing values in suburbs of Boston. The data frame has 506 observations and 14 variables.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#calling for the Boston dataset form MASS package
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
So we have 506 subrubs of Boston and here are the explanations of the different variables:
crim : per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitrogen oxides concentration (parts per 10 million) rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted mean of distances to five Boston employment centres rad: index of accessibility to radial highways tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat: lower status of the population (percent) medv: median value of owner-occupied homes in $1000s
#calling for the different packages I might use in this exercise
library(ggplot2)
library(GGally)
library(tidyverse)
library(corrplot)
## corrplot 0.84 loaded
#checking the summary of the Boston dataset,
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here is the summary of the data. Chas is a binary variable. Other variables execpt for the crim and zn seems to be normally distributed (mean and median more or less close to each other).
Let’s also make a graph..
#graphical overview
pairs(Boston)
With the pairs plot it’s not very easy the see any corralations. Let??s study the correlations between the different variables with correlation plot.
#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation
cormatrix <- cor(Boston)
cormatrix %>% round(digits=2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cormatrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
There is strong negative correlation (big red ball), between dis-nox, dis-age adn dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This makes sense.
Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is understandable.
Rad and tax have strong positive correlation, meaning when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises. Why not?
Let’s move furher with the analysis..
I need to standardise the dataset to get normally distributed data. I print the summary of the scaled data set.
#Standardising the dataset
boston_scaled <- scale(Boston)
#printing out summaries of the scaled data
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#My boston_sclaed data is a matrix and I make it as a data frame for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)
Now we have scaled the data and as seen in the summary now all the means and medians are close to each other meaning that they are normally distributed and with the help of the scaling this data can be fitted in a model.
Next I will change the continuous crime rate variable in my data set to be a categorical variable. I want to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. I drop the old crim variable from the dataset and replace it with the new crime variable.
#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#and create a categorial variable crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
dim(boston_scaled)
## [1] 506 14
Now I need to divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
#Here I make the train and the test sets. I choose 80% of the observations to the train set and the rest to the test set
dim(boston_scaled)
## [1] 506 14
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
#create the train set
train <- boston_scaled[ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 0.585 -0.487 -0.487 -0.487 ...
## $ indus : num 1.567 -0.915 -0.18 -0.164 -0.548 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num 0.5981 -1.1106 -0.0923 -0.0664 -0.5324 ...
## $ rm : num 0.0589 0.6296 -0.825 -0.5873 0.2012 ...
## $ age : num 1.035 -1.246 0.324 0.161 -0.578 ...
## $ dis : num -0.7238 0.7625 0.0712 -0.6257 0.354 ...
## $ rad : num -0.637 -0.637 -0.637 -0.408 -0.522 ...
## $ tax : num 0.171 -0.755 -0.618 0.141 -0.719 ...
## $ ptratio: num 1.2677 0.2515 -0.0257 -0.3028 0.5286 ...
## $ black : num 0.441 0.441 0.435 -0.198 0.441 ...
## $ lstat : num -0.055 -1.031 -0.161 0.38 -0.764 ...
## $ medv : num -0.319 0.594 -0.689 -0.232 0.138 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 3 1 2 2 2 2 4 4 4 4 ...
#create the test set
test <- boston_scaled[-ind,]
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num 0.0487 0.0487 0.0487 -0.4872 -0.4872 ...
## $ indus : num -0.476 -0.476 -0.476 -0.437 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.265 -0.265 -0.265 -0.144 -0.144 ...
## $ rm : num -0.399 -0.392 -0.563 -0.478 -0.268 ...
## $ age : num 0.615 0.509 -1.051 -0.241 0.566 ...
## $ dis : num 1.328 1.155 0.786 0.433 0.317 ...
## $ rad : num -0.522 -0.522 -0.522 -0.637 -0.637 ...
## $ tax : num -0.577 -0.577 -0.577 -0.601 -0.601 ...
## $ ptratio: num -1.5 -1.5 -1.5 1.18 1.18 ...
## $ black : num 0.329 0.441 0.371 0.441 0.256 ...
## $ lstat : num 0.6227 0.0864 0.4281 -0.6152 -0.3351 ...
## $ medv : num -0.395 -0.395 -0.0906 -0.2319 -0.4711 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 2 2 3 3 3 3 3 3 2 ...
Now I’m making a linear discriminant analysis on the train set. I use the categorical crime rate as the target variable and all the other variables are predictor variables. I draw the LDA (bi)plot of the model.
# fit the linear discriminant analysis on the train set. crime as the target variable and all the other variables as predictor variables
lda.fit <- lda(formula= crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# drawing a plot of the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Then I predict the classes with the LDA model on my test data and cross tabulate the results with the crime categories from the test set.
#saving the crime categories from the test set
correct_classes <- test$crime
library(dplyr)
#removing the categorial crime variable from the test dataset
test <- dplyr::select(test, -crime)
#Predicting the vallses with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
#cross tablating the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 8 0 0
## med_low 3 19 3 0
## med_high 0 12 12 1
## high 0 0 0 29
Here I can see how well my model is working with the predicting. My model works well predicting the high crime rates, byt it makes some errors predicting the other classes. The same phenomena was visible in the train set plot.
Next I move towards clustering and measure the distances. I use the Euklidean distance, which is possibly the most common one. K-means is old and much used clustering method. Kmeans counts the distance matrix automatically but I have to choose the number of clusters. I tryed to make the model wiht 4 clusters, but for me it seems that 3 clusters works better.
Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
#Reloading the Boston dataset and standardising the dataset (variables have to be normally ditributed)
dim(Boston)
## [1] 506 14
scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)
#Calculating the distances. Euklidean distance
dist_eu <- dist(scale_Boston2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(scale_Boston2, centers = 3)
# ploting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:6], col = km$cluster)
pairs(scale_Boston2[7:13], col = km$cluster)
Next I investigate what is the optimal number of clusters. There are many ways to find the optimal number of clusters but here I will be using the Total of within cluster sum of squares (WCSS) and visualising the result with a plot.
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is when the total WCSS drops radically and above I can see that it happens around x= 2. So the optimal number of clusters would be 2. Next I run the algorithm again with two clusters.
Clustering and Interpretation
The bigger pairs plot makes a bit difficult to interpret the results so I made also two more precise plots to be able to study some effects more precise. In the first plot (where all the variables are included) I can see that the variables chas doesn’t follow any pattern with any of the variables. There are many pairs that doesn??t follow any nice pattern. However I think I might find negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.
# k-means clustering
km <-kmeans(scale_Boston2, centers = 2)
# plot the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:8], col = km$cluster)
pairs(scale_Boston2[6:13], col = km$cluster)
.
This week I’m learning about dimensionality and reduction techniques. I will be working with ??human??-data. The dataset originates from the United Nations Development Programme and it is about measuring the development of a country with Human Development Index (HDI). More information about the HDI can be found here http://hdr.undp.org/en/content/human-development-index-hdi And more about the calculation of the HDI found here http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf
I’ve been doing some data wrnagling before the analysis and here is briefly summary about my data.
human <- read.table(file="~lottaimmeli/Documents/GitHub/IODS-project/Data/human.csv", header = TRUE, row.names=1, sep = ",")
dim(human)
## [1] 155 8
colnames(human)
## [1] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" "GNI"
## [6] "Mat.Mor" "Ado.Birth" "Rep.Per"
summary(human)
## Edu.Ratio Labo.Ratio Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Rep.Per
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
head(human)
## Edu.Ratio Labo.Ratio Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 81.6 17.5 64992 4 7.8
## Australia 0.9968288 0.8189415 82.4 20.2 42261 6 12.1
## Switzerland 0.9834369 0.8251001 83.0 15.8 56431 6 1.9
## Denmark 0.9886128 0.8840361 80.2 18.7 44025 5 5.1
## Netherlands 0.9690608 0.8286119 81.6 17.9 45435 6 6.2
## Germany 0.9927835 0.8072289 80.9 16.5 43919 7 3.8
## Rep.Per
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
I have a data of 155 countries and 8 variables of each country. First one can see the name of the country and the there are some indicators about the country’s development.
Empowerment: Edu.Ratio = Proportion of females with at least secondary education / Proportion of males with at least secondary education Labo.Ratio = Proportion of females in the labour force / Proportion of males in the labour force Rep.Per = Percetange of female representatives in parliament
Health and knowledge: Life.Exp = Life expectancy at birth Edu.Exp = Expected years of schooling GNI = Gross National Income per capita Mat.Mor = Maternal mortality ratio Ado.Birth = Adolescent birth rate
#looking the structude of the data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu.Ratio : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.Ratio: num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Rep.Per : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(tidyverse)
library(dplyr)
library(ggplot2)
library(GGally)
#Also making a graphical overview of the data
gather(human) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
h <- ggpairs(human, mapping = aes(alpha=0.5), lower=list(combo =wrap("facethist", bins=20)))
h
Here can be seen a graphical overview and a summary of the data. All the variables are numeric, one as a interval and other continous numeric variables. Only the Edu.Exp variable is normally distributed.
Let’s study the relationship between the variables with correlation matrix.
#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation
library(corrplot)
library(tidyverse)
correlation <- cor(human)
correlation %>% round(digits=2)
## Edu.Ratio Labo.Ratio Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth
## Edu.Ratio 1.00 0.01 0.58 0.59 0.43 -0.66 -0.53
## Labo.Ratio 0.01 1.00 -0.14 0.05 -0.02 0.24 0.12
## Life.Exp 0.58 -0.14 1.00 0.79 0.63 -0.86 -0.73
## Edu.Exp 0.59 0.05 0.79 1.00 0.62 -0.74 -0.70
## GNI 0.43 -0.02 0.63 0.62 1.00 -0.50 -0.56
## Mat.Mor -0.66 0.24 -0.86 -0.74 -0.50 1.00 0.76
## Ado.Birth -0.53 0.12 -0.73 -0.70 -0.56 0.76 1.00
## Rep.Per 0.08 0.25 0.17 0.21 0.09 -0.09 -0.07
## Rep.Per
## Edu.Ratio 0.08
## Labo.Ratio 0.25
## Life.Exp 0.17
## Edu.Exp 0.21
## GNI 0.09
## Mat.Mor -0.09
## Ado.Birth -0.07
## Rep.Per 1.00
corrplot.mixed(correlation, lower.col = "black", number.cex = .6)
Here can be seen that percetange of female representatives in parliament (Rep.Per) or Proportion of females in the labour force / Proportion of males in the labour force (Labo.Ratio) don’t seem to have strong correlations with any of the other variables.
The maternal mortality ratio (Mat.Mor) and life expectancy have strong negative correlation. Meaning that when maternal mortality ratio gets higher life expactancy gets lower, which makes sense. Also adolescence birth ratio (Ado.Birth) has strong negative correlation with life expectancy. Higher education datio and GNI seems to affect positively to life expactancy.
First I make the PCA to the non-standardised data.
Principal Component Analysis (PCA) can be performed by two slightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD). The function prcomp() function uses the SVD and is the preferred, more numerically accurate method.
#Making PCA with SVD method
pca_human <- prcomp(human)
sum_pca_human<-summary(pca_human)
sum_pca_human
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex= c(0.8,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of non-scaled human data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Now almost all the variables are gathered in a one courner and I get only one arrow. The other arrows are of zero length and indeterminate angle so the are skipped. Because the PCA is sensitive to the relative scaling of the original features and it assumes that features with larger variance are more important than features with smaller variance without the scaling my biplot looks like this. The GNI has the largest variance it becomes dominant here.
Standardisation of the data might be a good idea. So next I’m going to standardise the data and do the PCA again.
#Standardise the data and make the data matrix as data frame for the further analysis.
human_scaled <- scale(human)
str(human_scaled)
## num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## ..$ : chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
## - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 7.17e+01 1.32e+01 1.76e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
## - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 8.33 2.84 1.85e+04 ...
## ..- attr(*, "names")= chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
summary(human_scaled)
## Edu.Ratio Labo.Ratio Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Rep.Per
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
class(human_scaled)
## [1] "matrix"
human_scaled<- as.data.frame(human_scaled)
#Make the pca again with SVD method
pca_human_s <- prcomp(human_scaled)
sum_pca_human_s<-summary(pca_human_s)
pca_pr_s <- round(100*sum_pca_human_s$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr_s), " (", pca_pr_s, "%)")
sum_pca_human_var_s<-sum_pca_human_s$sdev^2
sum_pca_human_var_s
## [1] 4.2883701 1.2989625 0.7657100 0.6066276 0.4381862 0.2876242 0.2106805
## [8] 0.1038390
biplot(pca_human_s, choices = 1:2, cex= c(0.5,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of scaled human data")
The standarisation helps a lot. Now the relative scaling between the variables is more similar and the GNI (with largest variance) doesn’t run over the other variables.
Edu.Exp, GNI, Edu.Ratio and Life.Exp are situated together and the arrows share a small angle meaning that these variables have high positive correlation. The arrows of Mat.Mor and Ado.Birth are directed to the opposite direction meaning that they have high negative correlation with the earlier mentioned features. All these factors have high angle with Labo.Ratio and Rep. Per meaning that there is not high correlation.
The angle between a feature and a PC axis can be interpret as the correlation between the two. Small angle = high positive correlation.
The length of the arrows are proportional to the standard deviations of the features and they seem to be pretty similar with the different variables.
Next I will be doing some MCA. For this I need the FactoMineR package.
library(FactoMineR)
From this packages I will be using the tea data. Now let’s see how the data looks like.
data(tea)
colnames(tea)
## [1] "breakfast" "tea.time" "evening"
## [4] "lunch" "dinner" "always"
## [7] "home" "work" "tearoom"
## [10] "friends" "resto" "pub"
## [13] "Tea" "How" "sugar"
## [16] "how" "where" "price"
## [19] "age" "sex" "SPC"
## [22] "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "sophisticated" "slimming"
## [34] "exciting" "relaxing" "effect.on.health"
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
I can see that I have data of 300 observations and 36 variables. Except of the variable “age” all the other variables are categorical with 2 to 7 categories.
I also make a graphical overview of the data.
With MCA I can analyse the pattern of relationships of several categorical variables. MCA deals with categorical variables, but continuous ones can be used as background (supplementary) variables
For the categorical variables, I can either use the indicator matrix or the Burt matrix in the analysis The Indicator matrix contains all the levels of categorical variables as a binary variables (1 = belongs to category, 0 = if doesn’t) Burt matrix is a matrix of two-way cross-tabulations between all the variables in the dataset
For the further analysis I will not be using all the 36 variables. Let’s choose some of them. And make a summary and graphical overview.
#Choose the column to keep
library(dplyr)
library(tidyverse)
library(FactoMineR)
keep <- c("Tea", "How", "sugar", "sex", "age_Q", "breakfast", "work", "price")
#make a new data set with the selected columns
tea1 <- tea[, keep]
library(GGally)
library(ggplot2)
#summary and visualisation of the tea set with my chosen variables
summary(tea1)
## Tea How sugar sex age_Q
## black : 74 alone:195 No.sugar:155 F:178 15-24:92
## Earl Grey:193 lemon: 33 sugar :145 M:122 25-34:69
## green : 33 milk : 63 35-44:40
## other: 9 45-59:61
## +60 :38
##
## breakfast work price
## breakfast :144 Not.work:213 p_branded : 95
## Not.breakfast:156 work : 87 p_cheap : 7
## p_private label: 21
## p_unknown : 12
## p_upscale : 53
## p_variable :112
gather(tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea1, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea1, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.219 0.185 0.182 0.162 0.154 0.138
## % of var. 9.737 8.231 8.068 7.191 6.831 6.137
## Cumulative % of var. 9.737 17.968 26.036 33.228 40.058 46.195
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.135 0.129 0.127 0.118 0.109 0.109
## % of var. 6.009 5.712 5.653 5.231 4.866 4.835
## Cumulative % of var. 52.203 57.916 63.568 68.799 73.666 78.501
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.098 0.092 0.089 0.071 0.068 0.065
## % of var. 4.366 4.102 3.947 3.173 3.035 2.876
## Cumulative % of var. 82.867 86.969 90.916 94.089 97.124 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.135 0.028 0.004 | 0.235 0.099 0.012 | 0.882 1.428
## 2 | 0.561 0.479 0.162 | 0.751 1.014 0.290 | 0.022 0.001
## 3 | -0.086 0.011 0.005 | 0.591 0.629 0.239 | -0.418 0.321
## 4 | -0.462 0.325 0.192 | -0.469 0.395 0.198 | -0.087 0.014
## 5 | 0.122 0.023 0.011 | 0.262 0.124 0.052 | -0.029 0.002
## 6 | -0.291 0.129 0.033 | -0.279 0.140 0.031 | -0.188 0.065
## 7 | 0.061 0.006 0.002 | 0.309 0.172 0.058 | 0.143 0.038
## 8 | 0.507 0.392 0.115 | 0.556 0.556 0.138 | 0.057 0.006
## 9 | 0.408 0.254 0.069 | 0.340 0.208 0.048 | 0.615 0.695
## 10 | 0.808 0.993 0.280 | 0.220 0.087 0.021 | 0.407 0.304
## cos2
## 1 0.163 |
## 2 0.000 |
## 3 0.120 |
## 4 0.007 |
## 5 0.001 |
## 6 0.014 |
## 7 0.012 |
## 8 0.001 |
## 9 0.156 |
## 10 0.071 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.097 16.950 0.394 10.859 | 0.336 1.881 0.037
## Earl Grey | -0.532 10.397 0.511 -12.360 | 0.056 0.135 0.006
## green | 0.652 2.666 0.052 3.962 | -1.080 8.663 0.144
## alone | -0.081 0.243 0.012 -1.908 | -0.127 0.712 0.030
## lemon | -0.121 0.092 0.002 -0.736 | -0.564 2.366 0.039
## milk | 0.053 0.034 0.001 0.471 | 0.566 4.545 0.085
## other | 1.829 5.723 0.103 5.561 | 0.866 1.519 0.023
## No.sugar | 0.462 6.300 0.228 8.265 | 0.403 5.664 0.174
## sugar | -0.494 6.734 0.228 -8.265 | -0.431 6.055 0.174
## F | -0.049 0.080 0.003 -1.015 | 0.288 3.323 0.121
## v.test Dim.3 ctr cos2 v.test
## black 3.326 | 0.337 1.930 0.037 3.335 |
## Earl Grey 1.296 | -0.085 0.317 0.013 -1.965 |
## green -6.566 | -0.261 0.516 0.008 -1.587 |
## alone -3.002 | -0.349 5.462 0.227 -8.232 |
## lemon -3.431 | 0.555 2.333 0.038 3.374 |
## milk 5.048 | 0.781 8.825 0.162 6.965 |
## other 2.634 | 0.066 0.009 0.000 0.200 |
## No.sugar 7.205 | -0.319 3.620 0.109 -5.703 |
## sugar -7.205 | 0.341 3.870 0.109 5.703 |
## F 6.016 | -0.559 12.787 0.457 -11.685 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.526 0.158 0.040 |
## How | 0.107 0.135 0.241 |
## sugar | 0.228 0.174 0.109 |
## sex | 0.003 0.121 0.457 |
## age_Q | 0.550 0.332 0.313 |
## breakfast | 0.000 0.173 0.055 |
## work | 0.097 0.325 0.056 |
## price | 0.241 0.063 0.181 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
Interpretation of the MCA The distance between the different points gives a measure of their similarity (or dissimilarity). Younger peole (age cat 15-24 and 25-34) likes to have tea with sugar and at other time than breakfast. And people age 35-44 and 45-59 prefer tea without sugar.